pRactice corner: Captive Africans Dataset

lruolin

Introduction

This is dataset on captive Africans, and Julia Silge shows how to impute missing data in her worked exmaple, which I am following closely. The link is here.

Learning points

Using recipes to impute missing data
Using nanir for the gg_miss_upset plot to see the pattern of missingness
Using moments package to confirm skewness of distribution as a numerical value

Load packages

library(pacman)

p_load(tidyverse, tidymodels, ggsci, GGally, janitor, ggthemes, skimr, naniar, kable, kableExtra, moments)

Load Data

More information regarding the dataset can be found here.

african_names_raw <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-06-16/african_names.csv")

EDA

Data shape

glimpse(african_names_raw)

Rows: 91,490
Columns: 11
$ id             <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14…
$ voyage_id      <dbl> 2314, 2315, 2315, 2315, 2315, 2315, 2315, 231…
$ name           <chr> "Bora", "Flam", "Dee", "Pao", "Mufa", "Latty"…
$ gender         <chr> "Man", "Man", "Man", "Man", "Man", "Man", "Ma…
$ age            <dbl> 30, 30, 28, 22, 16, 22, 20, 30, 18, 23, 30, 2…
$ height         <dbl> 62.5, 64.0, 65.0, 62.5, 59.0, 67.5, 62.0, 65.…
$ ship_name      <chr> "NS de Regla", "Fabiana", "Fabiana", "Fabiana…
$ year_arrival   <dbl> 1819, 1819, 1819, 1819, 1819, 1819, 1819, 181…
$ port_disembark <chr> "Freetown", "Freetown", "Freetown", "Freetown…
$ port_embark    <chr> "Little Bassa", "Trade Town", "Trade Town", "…
$ country_origin <chr> NA, NA, NA, "Crue", "Crue", NA, "Crue", "Crue…

This is a long dataset with 91,490 rows and 11 columns

Skim

skim(african_names_raw)

Table 1: Data summary

Name	african_names_raw
Number of rows	91490
Number of columns	11
_______________________
Column type frequency:
character	6
numeric	5
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
name	0	1.00	2	24	62330
gender	12878	0.86	3	5	4
ship_name	1	1.00	2	59	443
port_disembark	0	1.00	6	19	5
port_embark	1126	0.99	4	31	59
country_origin	79404	0.13	3	31	563

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
id	0	1.00	62122.02	51305.07	1.0	22935.25	45822.5	101263.8	199932	▇▆▃▁▂
voyage_id	0	1.00	17698.25	82016.88	557.0	2443.00	2871.0	3601.0	500082	▇▁▁▁▁
age	1126	0.99	18.89	8.60	0.5	11.00	20.0	26.0	77	▆▇▁▁▁
height	4820	0.95	58.61	6.84	0.0	54.00	60.0	64.0	85	▁▁▂▇▁
year_arrival	0	1.00	1831.40	9.52	1808.0	1826.00	1832.0	1837.0	1862	▂▆▇▃▁

There are a lot of missing data for:

Categorical: gender, port-embark, country of origin
Numerical: age, height

Gender

african_names_raw %>% 
  count(gender)

# A tibble: 5 × 2
  gender     n
  <chr>  <int>
1 Boy    18896
2 Girl   12019
3 Man    33827
4 Woman  13870
5 <NA>   12878

african_names_raw %>% 
  select(gender, age) %>% 
  ggplot(aes(gender, age)) +
  geom_boxplot(aes(fill = gender), alpha = 0.6, show.legend = F) +
  scale_fill_jco() +
  theme_clean()

many missing data
some children were coded as adults, and vice versa
to recode boys, man into male, and girls, woman into female.

Ship name

african_names_raw %>% 
  count(ship_name) %>% 
  arrange(desc(n))

# A tibble: 444 × 2
   ship_name             n
   <chr>             <int>
 1 Maria              1116
 2 Carolina           1089
 3 Emilia              886
 4 Esperança           860
 5 Vingador            845
 6 Iberia              726
 7 Águia               720
 8 Málaga of Belouru   716
 9 Josephina           712
10 Felicidade          680
# … with 434 more rows

444 types of different ships

Port Disembark

african_names_raw %>% 
  count(port_disembark, sort = T) %>% 
  kbl()

port_disembark	n
Freetown	81009
Havana	10058
Bahamas unspecified	183
Kingston, Jamaica	144
St. Helena	96

5 major ports, this can be recoded into a factor

Port Embark

african_names_raw %>% 
  count(port_embark, sort = T) %>% 
  kbl()

port_embark	n
Lagos	15730
Bonny	14449
Ouidah	8306
Old Calabar	7192
Gallinhas	4522
Bight of Benin unspecified	3958
Rio Pongo	2922
Badagry	2729
Cameroons River	2165
Sierra Leone unspecified	2091
Loango	1926
Bight of Biafra unspecified	1856
Rio Brass	1603
Popo	1542
Sherbro	1422
Rio Nun	1405
New Calabar	1324
Freetown	1167
NA	1126
Cabinda	1124
Ambriz	1101
Bimbia	892
Cameroons, unspecified	862
Congo North	715
Bissau	679
Anomabu	627
Gabon	624
West Central Africa unspecified	597
Mozambique	578
Jacquin	540
Congo River	524
Cap Lopez	473
Cape Mount	463
Porto Novo	450
Luanda	410
Keta	388
Quicombo	362
Gambia	285
Grand Bassa	249
Windward Coast unspecified	249
Cacheu	248
Grand Mesurado	233
St. Paul	231
Gold Coast unspecified	209
Nova Redonda	192
Little Bassa	184
Sestos	169
Mayumba	94
Senegambia, unspecified	83
Iles de Los	69
Rio Nunez	52
Ile Principé	22
Bananas, Goree and Senegal	17
Iles Plantain	13
Trade Town	13
Gorée	11
Mano	10
Oerê	8
Petit Popo	3
Corisco	2

60 different ports

Year arrival

min(african_names_raw$year_arrival) # 1808

[1] 1808

max(african_names_raw$year_arrival) # 1862

[1] 1862

african_names_raw %>% 
  group_by(year_arrival) %>% 
  summarise(n = n()) %>% 
  ggplot(aes(year_arrival, n)) +
  geom_col(col = "grey", fill = "deepskyblue4") +
  scale_fill_jco() +
  scale_x_continuous(limits = c(1805, 1865), 
                     n.breaks =8) +
  labs(title = "No. of arrivals over time, between years 1808-1862") +
  theme_classic()

There seem to be a gap in information between 1850 - 1860. To filter out information after 1850.

african_names_raw %>% 
  filter(year_arrival>1850) %>% 
  count() %>% 
  mutate(pct = n/nrow(african_names_raw)*100)

# A tibble: 1 × 2
      n   pct
  <int> <dbl>
1    96 0.105

This is about 0.1% of the data that is discarded.

Numerical variables

glimpse(african_names_raw)

Rows: 91,490
Columns: 11
$ id             <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14…
$ voyage_id      <dbl> 2314, 2315, 2315, 2315, 2315, 2315, 2315, 231…
$ name           <chr> "Bora", "Flam", "Dee", "Pao", "Mufa", "Latty"…
$ gender         <chr> "Man", "Man", "Man", "Man", "Man", "Man", "Ma…
$ age            <dbl> 30, 30, 28, 22, 16, 22, 20, 30, 18, 23, 30, 2…
$ height         <dbl> 62.5, 64.0, 65.0, 62.5, 59.0, 67.5, 62.0, 65.…
$ ship_name      <chr> "NS de Regla", "Fabiana", "Fabiana", "Fabiana…
$ year_arrival   <dbl> 1819, 1819, 1819, 1819, 1819, 1819, 1819, 181…
$ port_disembark <chr> "Freetown", "Freetown", "Freetown", "Freetown…
$ port_embark    <chr> "Little Bassa", "Trade Town", "Trade Town", "…
$ country_origin <chr> NA, NA, NA, "Crue", "Crue", NA, "Crue", "Crue…

african_names_raw %>% 
  select(age, height) %>% 
  pivot_longer(everything()) %>% 
  ggplot(aes(name, value)) + 
  geom_boxplot() +
  facet_grid(. ~ name, scales = "free") +
  theme_classic()

african_names_raw %>% 
  select(age) %>% 
  ggplot(aes(age)) +
  geom_freqpoly() +
  labs(title = "Frequency distribution for Age") +
  scale_y_continuous(labels = scales::comma_format()) +
  theme_clean()

min(african_names_raw$age, na.rm = T)

[1] 0.5

max(african_names_raw$age, na.rm = T)

[1] 77

african_names_raw %>% 
  filter(age == 0.5)

# A tibble: 7 × 11
      id voyage_id name     gender   age height ship_name year_arrival
   <dbl>     <dbl> <chr>    <chr>  <dbl>  <dbl> <chr>            <dbl>
1  17082      3019 Bahboh   Boy      0.5     22 Donna Ba…         1829
2  18294      3014 Ahbetah  Boy      0.5     22 Hosse             1829
3  18947      3023 Yahjoh   Boy      0.5     24 Emilia            1829
4  19421      3024 Dahmajae Girl     0.5     24 Santo Ia…         1829
5 103325      7627 Ballee   Girl     0.5     NA S Miguel…         1812
6 104037      7508 Sagata   Boy      0.5     NA Andorinha         1812
7 104138      7508 Ingoho   Girl     0.5     NA Andorinha         1812
# … with 3 more variables: port_disembark <chr>, port_embark <chr>,
#   country_origin <chr>

african_names_raw %>% 
  mutate(height_cm = height * 2.54) %>% 
  filter(age < 2) %>% 
  select(age, height_cm)

# A tibble: 128 × 2
     age height_cm
   <dbl>     <dbl>
 1   1.5      31.8
 2   1        76.2
 3   1        76.2
 4   1       107. 
 5   1.5      96.5
 6   1        61.0
 7   1        45.7
 8   1        61.0
 9   1        61.0
10   1        34.3
# … with 118 more rows

There are babies who are sold as slaves too.

african_names_raw %>% 
  count(height, sort = T)

# A tibble: 201 × 2
   height     n
    <dbl> <int>
 1     63  5446
 2     64  5236
 3     62  5207
 4     65  4885
 5     61  4843
 6     NA  4820
 7     59  4315
 8     66  3937
 9     60  3601
10     58  3181
# … with 191 more rows

african_names_raw %>%
  mutate(height_cm = height * 2.54) %>% 
  select(height_cm) %>% 
  ggplot(aes(height_cm)) +
  geom_freqpoly() +
  labs(title = "Frequency distribution for Height") +
  scale_y_continuous(labels = scales::comma_format()) +
  theme_clean()

Edited dataset

recode gender into male and female, retain NA values for imputation later
change height from inch to cm

african_names <- african_names_raw %>% 
  filter(year_arrival <= 1850) %>% 
  mutate(gender_recode = factor(gender),
         gender_recode = case_when(gender_recode == "Boy" ~ "Male",
                                   gender_recode == "Man" ~ "Male",
                                   gender_recode == "Girl" ~ "Female",
                                   gender_recode == "Woman" ~ "Female")) %>% 
  
  mutate(port_disembark = factor(port_disembark)) %>% 
  mutate(height_cm = round(height * 2.54, 2)) %>% 
  select(-gender, -height) %>% 
  mutate(across(where(is.character), factor))

# check gender
african_names %>% 
  count(gender_recode)

# A tibble: 3 × 2
  gender_recode     n
  <fct>         <int>
1 Female        25889
2 Male          52723
3 <NA>          12782

# glimpse
glimpse(african_names)

Rows: 91,394
Columns: 11
$ id             <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14…
$ voyage_id      <dbl> 2314, 2315, 2315, 2315, 2315, 2315, 2315, 231…
$ name           <fct> "Bora", "Flam", "Dee", "Pao", "Mufa", "Latty"…
$ age            <dbl> 30, 30, 28, 22, 16, 22, 20, 30, 18, 23, 30, 2…
$ ship_name      <fct> NS de Regla, Fabiana, Fabiana, Fabiana, Fabia…
$ year_arrival   <dbl> 1819, 1819, 1819, 1819, 1819, 1819, 1819, 181…
$ port_disembark <fct> "Freetown", "Freetown", "Freetown", "Freetown…
$ port_embark    <fct> "Little Bassa", "Trade Town", "Trade Town", "…
$ country_origin <fct> NA, NA, NA, "Crue", "Crue", NA, "Crue", "Crue…
$ gender_recode  <fct> Male, Male, Male, Male, Male, Male, Male, Mal…
$ height_cm      <dbl> 158.75, 162.56, 165.10, 158.75, 149.86, 171.4…

african_names %>% 
  select(age, height_cm, gender_recode) %>% 
  drop_na(gender_recode) %>% 
  ggpairs(aes(col = gender_recode)) +
  scale_color_jco() +
  scale_fill_jco() +
  theme_classic()

age is a bimodal distribution
height also follows bimodal distribution
this is probably because boys, girls were coded as males, females respectively

african_names %>% 
  select(height_cm) %>% 
  ggplot(aes(height_cm)) +
  geom_freqpoly() +
  theme_classic()

# to confirm skewness using moments package
height_vector <- african_names %>% 
                  select(height_cm) %>% 
                  drop_na() %>% # remove na for calculation
                  as_vector() # for skewness calculation

moments::skewness(height_vector) #-0.767 : negatively skewed

[1] -0.7675827

There are many different approaches to the interpretation of the skewness values. A rule of thumb states that:

Symmetric: Values between -0.5 to 0.5
Moderated Skewed data: Values between -1 and -0.5 or between 0.5 and 1
Highly Skewed data: Values less than -1 or greater than 1

naniar package

african_names %>% 
  select(age, height_cm, gender_recode) %>% 
  naniar::gg_miss_upset()

This plot shows that most of the missing data for the three variables come from gender, followed by height and age.

Missing data imputation

For machine learning, the data should be split into train and test data before imputation. For this example, Julia Silge wanted to focus on statistical inference rather than predictive analysis, hence the data was only pre-processed to impute missing data, without splitting the data first.

The aim for the modelling was to find out what changed as years went by, in terms of gender, age and height. However, for the purpose of this exercise, it was for me to familiarise myself with the concept of imputing missing data.

impute_recipe <- recipe(year_arrival ~ gender_recode + age + height_cm,
                        data = african_names) %>% 
                step_impute_median(height_cm) %>% # since is negatively skewed
                step_impute_knn(all_predictors()) # using height to impute age and gender

imputed_data <- impute_recipe %>% 
                prep() %>% 
                bake(new_data = NULL)

To check imputed data

imputed_data %>% 
  skim()

Table 2: Data summary

Name	Piped data
Number of rows	91394
Number of columns	4
_______________________
Column type frequency:
factor	1
numeric	3
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
gender_recode	0	1	FALSE	2	Mal: 60654, Fem: 30740

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
age	1	18.84	8.58	0.5	11.00	19.0	26.00	77.0	▆▇▁▁▁
height_cm	1	149.04	16.95	0.0	137.16	152.4	162.56	215.9	▁▁▂▇▁
year_arrival	1	1831.37	9.48	1808.0	1826.00	1832.0	1837.00	1848.0	▂▂▇▇▃

Now, there is no more missing data.

LM

LM is actually not suitable for the linear model defined. Appropriate data transformations should be carried out before doing linear modelling, such as taking care of the skewed data, or robust linear regression.

lm <- lm(year_arrival ~ gender_recode + age + height_cm,
         data = imputed_data)

gvlma::gvlma(lm)


Call:
lm(formula = year_arrival ~ gender_recode + age + height_cm, 
    data = imputed_data)

Coefficients:
      (Intercept)  gender_recodeMale                age  
       1834.13827            0.35515            0.02380  
        height_cm  
         -0.02316  


ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance =  0.05 

Call:
 gvlma::gvlma(x = lm) 

                     Value p-value                   Decision
Global Stat        29783.5       0 Assumptions NOT satisfied!
Skewness            1658.4       0 Assumptions NOT satisfied!
Kurtosis             627.1       0 Assumptions NOT satisfied!
Link Function        148.4       0 Assumptions NOT satisfied!
Heteroscedasticity 27349.6       0 Assumptions NOT satisfied!

Data preprocessing

The order for carrying out data-preprocessing using recipes package should be in the following order:

impute missing data
individual transformations (log, sqrt, inverse to remove skewness). Also to take care of outliers.
discretization if required (usually not recommended to bin predictors)
dummy variables and encodings
create interaction variables
normalization - scaling (data has sd of 1), centering (mean = 0)
multivariate transformation
filters eg near-zero variance filter, high correlation filter to remove predictors
row operations eg downsample, upsample, themis::step_smote
others eg renaming variables
check functions

As mentioned earlier, for machine learning, data-preprocessing should be carried out on training data only.

Reference:

-https://juliasilge.com/blog/captive-africans-voyages/

https://www.r-bloggers.com/2020/11/skewness-and-kurtosis-in-statistics/

Comment on this article Share:

Captive Africans Dataset